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A substantial focus of research in molecular biology are gene reg- 
ulatory networks: the set of transcription factors and target genes 
which control the involvement of different biological processes in liv- 
ing cells. Previous statistical approaches for identifying gene regula- 
tory networks have used gene expression data, ChIP binding data 
or promoter sequence data, but each of these resources provides 
only partial information. We present a Bayesian hierarchical model 
that integrates all three data types in a principled variable selection 
framework. The gene expression data are modeled as a function of 
the unknown gene regulatory network which has an informed prior 
distribution based upon both ChIP binding and promoter sequence 
data. We also present a variable weighting methodology for the prin- 
cipled balancing of multiple sources of prior information. We apply 
our procedure to the discovery of gene regulatory relationships in 
Saccharomyces cerevisiae (Yeast) for which we can use several ex- 
ternal sources of information to validate our results. Our inferred 
relationships show greater biological relevance on the external vali- 
dation measures than previous data integration methods. Our model 
also estimates synergistic and antagonistic interactions between tran- 
scription factors, many of which are validated by previous studies. 
We also evaluate the results from our procedure for the weighting for 
multiple sources of prior information. Finally, we discuss our method- 
ology in the context of previous approaches to data integration and 
Bayesian variable selection. 
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1. Introduction and motivation. The development and function of living 
cells is, to a large extent, dictated by a carefully choreographed system of 
gene expression. Gene expression is controlled in part by transcription fac- 
tors (TFs), a class of proteins which bind to DNA leading to an increase 
or decrease in transcription of target genes. The collection of transcription 
factors and their targets (genes that they control) is called a regulatory net- 
work. In this work, we develop a model for understanding the transcriptional 
regulatory networks that specify and maintain cellular function. 

The computational approaches that are used to identify regulatory net- 
works have typically used information from three different sources: 

1. Gene expression data. Microarray chips are used to measure the levels of 
mRNA produced for each gene in a cell, which is usually referred to as 
the amount of gene expression. Since mRNA is a precursor to the protein 
product of each gene, expression levels are used as proxy for the amount 
of protein produced. Genes which show similar levels of expression in 
different conditions are believed to be co-regulated. 

2. ChIP binding data. Chromatin Immunoprecipitation technology uses an- 
tibodies to isolate sequences that are directly bound by a specific tran- 
scription factor. Microarray chips are then used to chart these sequences 
within the genome in order to determine potential locations for binding 
of that particular transcription factor. 

3. Promoter sequence data. The different binding sites (located near different 
target genes) of the same transcription factor show a significant sequence 
conservation, but substantial variability is also present. The conserved 
appearance of the transcription factor binding sites is summarized by a 
position-specific weight matrix (PWM) which can be used to search near 
to potential target genes for the predicted binding sites of a transcription 
factor. The strength of the signal in these PWMs varies substantially 
between transcription factors. 

Although each of these resources are extremely useful, their power is in- 
herently limited by the fact that each type of data provides only partial 
information: expression data provide only indirect evidence of regulation, 
promoter sequence data provide only potential binding sites which may 
not be bound by TFs and ChIP binding data provide only physical bind- 
ing locations which may not be functional in terms of controlling gene 
expression. We develop a Bayesian hierarchical model for combining our 
three available sources of information: gene expression data, ChIP bind- 
ing data and promoter sequence data. This is accomplished by extending 
previous linear models for gene expression data [Bussemaker, Li and Siggia 
(2001), Gao, Foat and Bussemaker (2004)] into a variable selection frame- 
work. There has been substantial research into Bayesian approaches to vari- 
able selection, though as mentioned by George (2000), most previous meth- 
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ods have focused on minimization of prior dependence. In contrast, our ap- 
proach takes advantage of two additional data sources, ChIP binding and 
promoter element data, to generate informed prior distributions for our vari- 
able selection model. We also develop a variable weighting methodology for 
balancing these two sources of prior information. 

There has also been substantial previous research into the integration 
of biological data sources for the discovery of regulatory networks. 
Bussemaker, Li and Siggia (2001) developed a linear model to reflect the 
correlation between expression patterns and cis-regulatory motif abundance, 
with the inherent drawback that any synergistic effects from transcription 
factor interaction were not taken into account. Tadesse, Vannucci and Lio 

(2004) posited a similar linear model between expression patterns and cis- 
regulatory motif abundance, but used Bayesian variable selection instead 
of the stepwise regression procedure of Bussemaker, Li and Siggia (2001). 
Gao, Foat and Bussemaker (2004) presented an integrated linear model, MA- 
Networker, for combining expression and ChIP binding data, but their pro- 
cedure required a stringent binding p-value threshold. Banerjee and Zhang 
(2003) used thresholded ChIP binding data and gene expression data to iden- 
tify cooperativity among TFs. Our model also allows us to estimate synergis- 
tic and antagonistic interactions between transcription factors. Xing and Laan 

(2005) developed a multiple linear regression model selected by a loss-based 
V-fold cross-validation selector, but the method relies on knowledge of known 
TF sites and the number of TFs with known consensus binding sites is small 
and their functional coverage is somewhat limited. Based on the assump- 
tion that the expression levels of regulated genes depend on the expression 
levels of regulators, Segal (2001, 2003) constructed a probabilistic model 
which used binding motif features and expression data to identify modules 
of co-regulated genes and their regulators. This probabilistic model reflected 
nonlinear properties, but required prior clustering of the expression data. 
The GRAM algorithm combining ChIP binding and expression data was 
developed by Bar-Joseph et al. (2003) to discover regulatory networks in 
Saccharomyces cerevisiae, but their technique is heuristic with arbitrary pa- 
rameter thresholds and little systematic modeling. Another threshold-based 
approach is taken by Lemmens et al. (2006) in their ReMoDiscovery method. 

Our full probabilistic model does not rely on pre-clustering of expression 
data and reduces dependence on arbitrary parameter cutoffs. As mentioned 
by Kloster, Tang and Wingreen (2005), many current methods rely on the 
basic assumption that each gene can only belong to a single cluster. Our 
framework permits genes to belong to multiple regulatory clusters, which al- 
lows us to model multiple biological pathways simultaneously. Other recent 
efforts [Liao et al. (2003), Yang (2005), Boulesteix and Strimmer (2005)] 
have used a "network component analysis" approach to find regulatory mod- 
ules using expression data. In these investigations, ChIP binding data are 
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used to form a connectivity network between genes and TFs, which is as- 
sumed to be known without error. In contrast, our model allows for the 
inherent uncertainty in ChIP experiments, which allows for a more direct in- 
tegration of ChIP binding and gene expression, but uses TF gene expression 
as our measure of TF activity. Sabatti (2005) also use the "network compo- 
nent analysis" approach to model gene expression using a prior distribution 
based on promotor binding sites, which did allow for some uncertainty in 
the sequence information, but did not include any ChIP binding data. We 
will revisit the distinction between our approach and "network component 
analysis" in our discussion. 

In Section 2 we outline our Bayesian variable selection model for inte- 
grating multiple data sources and discuss implementation using a Gibbs 
sampling algorithm [Geman (1984)]. In Section 3 we describe an applica- 
tion of our methodology to Saccharomyces cerevisiae (Yeast) where gene 
expression, ChIP binding data and promoter sequence data are available for 
many transcription factors. We validate our results in yeast using external 
information from the biological literature and compare to several alterna- 
tive methods. In a related paper [Chen, Jensen and Stoeckert (2007)] we 
present a reduced application of our model to Yeast, as well as applications 
in the higher organism Mus musculus (mouse). We explore several interest- 
ing model consequences and sensitivities in Section 4. Finally, in Section 5 we 
discuss our model in the context of previous methods for regulatory network 
elucidation, as well as previous approaches to variable selection. 

2. Bayesian model and implementation. The primary goal of our sta- 
tistical model is to infer probable gene-TF relationships through the in- 
tegration of available biological data. Mathematically, we formulate these 
relationships as unknown indicator variables Cij = 1 if gene i is regulated 
by TF j or otherwise. Our inference for these regulation indicators Cy is a 
variable selection process that determines which subset of the many possible 
gene-TF relationships are biologically important and allows us to construct 
an inferred regulatory network. This network can be visually represented as 
a graph where nodes are genes and TFs, and each Cij variable determines 
whether or not there should be a directed edge connecting the node for TF 
j with the node for gene i. Collectively, the matrix C of these indicator 
variables also gives us regulatory clusters (also called regulatory modules) 
for each TF, since all genes i where Cij = 1 are estimated to be in a cluster 
together regulated by TF j. An important aspect of our flexible framework 
is that we are explicitly allowing genes to belong to multiple clusters con- 
trolled by different transcription factors (i.e., Cij = 1 and Cj,/ = 1 for j ^ j'). 
In order to infer likely values for our indicator variables C, our model incor- 
porates up to three general classes of biological information: gene expression 
data, ChIP binding data and sequence-level promoter data. 
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We denote our gene expression data as gu, the expression of gene i 
(i = 1, . . . , N) in experiment t (t = 1, . . . , T). The set of T experiments can 
be from different tissues, time-course experiments, different gene-knockout 
experiments, or any combination thereof. Within these expression data, we 
give special focus to the expression of genes that produce known transcrip- 
tion factor proteins. For our J known transcription factors, we denote fjt 
as the expression of TF j (j = 1, . . . , J) in experiment t. The TF expression 
levels fjt are derived from our gene expression data by simply identifying the 
gene that encodes each transcription factor j, and using the expression level 
of that gene as our TF expression levels. We will use the expression fjt for 
the gene that produces TF j as a proxy for the amount of activity of TF j . 
In addition to expression data, we have available Chromatin Immunoprecip- 
itation (ChIP) experiments which give information on the physical binding 
location of specific transcription factors. We use by to denote the probabil- 
ity that transcription factor j physically binds in close proximity to gene i, 
from a ChIP binding experiment for transcription factor j. Finally, we have 
available sequence-level information in the form of known or putative pro- 
moter binding sites for specific transcription factors located in the upstream 
regions of target genes. We denote 777/2 j the probability that transcrip- 
tion factor j has a promoter binding site in the regulatory region of gene i. 
These binding sites could be experimentally verified or predicted by scanning 
upstream sequences for similarity to an established position-specific weight 
matrix (PWM) for a particular transcription factor. We outline our model 
in the most general case where all three of these data types are present, but 
we will also discuss the ramifications on our procedure when only subsets of 
these data types are available. Our different data sources are summarized in 
Table 1. 

The first level of our probabilistic model incorporates our gene expression 
data by specifying the observed gene expression ga as a linear function of 
TF expression, fjt, 

J 

(1) g it = Ui + ^2(3jCijfj t + eit, t it ~Normal(0,<r 2 ). 

3=1 



Table 1 

Notation for available data sources 



Notation 


Data type 


9it = 


expression of gene i in experiment t 


Gene expression 


fit = 


expression of TF j in experiment t 


TF expression 


hj = 


probability that TF j binds near to gene i 


ChIP binding 


rriij = 


= probability that gene i has promoter element for TF j 


Promoter sequence 



G 



S. T. JENSEN, G. CHEN AND C. J. STOECKERT 



In equation (1) we see that our regulation indicators Cjj act as variable 
selection parameters: only TFs j where Cj,- = 1 are allowed to influence the 
expression of gene i. The parameter (5j is the linear effect of TF j on gene 
expression, whereas o.i can be interpreted as the baseline expression for gene 
i in absence of regulation by known transcription factors (i.e., Cj,- = for 
all TFs j). Bussemaker, Li and Siggia (2001) also used a linear model for 
expression data, except that their approach did not use TF expression fj t 
as a proxy for TF activity, but rather used sequence elements as their proxy 
for TF activity. We prefer the use of TF expression as our proxy for TF 
activity since our TF expression levels fjt are specific to each experiment 
t in the same way as our gene expression levels ga- Sequence information 
is not experiment or condition-specific and so is less useful as a proxy for 
TF activity. However, we do make use of sequence elements in our prior 
distribution (3) for the global regulation indicators Cy. 

Our simple linear model, as stated in (1), is limited by not allowing for 
combinatorial relationships between TFs. Each TF j has a single effect {(5j) 
on the expression of gene i, which does not take into account the biological 
reality that expression is often the result of synergistic or antagonistic action 
of multiple TFs binding simultaneously. We acknowledge these combinatorial 
relationships by expanding our linear model to include interaction terms: 

J 

(2) git = OLi + ^2 PjCijfjt + X! IjkCijCikfjtfkt + £it> 

J'=l j^k 

e it ~ Normal(0, a 2 ), 

where we now have additional coefficients jjk that can be interpreted as the 
synergistic (or antagonistic) effect of both TFs j and k binding together to 
the same upstream region (in addition to the effects of TF j or k binding 
in isolation). Note that our regulation indicators Cy again act as variable 
selectors for both the linear and interaction terms in equation (2). Of course, 
higher-order interactions or nonlinear functions could also be considered in 
our framework. However, this additional model complexity would increase 
the parameter space and computation burden of the model dramatically. 
We believe that our extended model with TF interactions (2) achieves an 
appropriate balance between the computational cost of model fitting and 
the flexibility to adequately model TF-gene expression relationships. 

Despite the intuitive appeal of positing linear models [Bussemaker, Li and Siggia 
(2001), Tadesse, Vannucci and Lio (2004), Gao, Foat and Bussemaker (2004)] 
as a variable selection problem, the implementation of our variable selection 
model is quite complex in practice, with a large number of both genes i 
(e.g., 6026 in our yeast application) and TFs j (e.g., 39 in our yeast appli- 
cation). We address this complexity by using our additional data types to 
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construct informed prior distributions for each regulation indicator Cij. We 
have bij, the probability that TF j physically binds in proximity to gene 
i in a ChlP-binding experiment, and m«, the probability of a binding site 
for TF j in the upstream region of gene i. The second component of our 
model incorporates both bij and rriij into a combined prior distribution for 
our unknown regulation indicators C^: 

(3) ,,«:,„,„. !>,,.„■:) x [bf;>(i - ^f-^r • ><r(' - ™«) ] r 1 "' • 

The variable Wj is the relative weight of the prior ChlP-binding information 
bij versus the TF binding site information rriij. The weights w = (w\, . . . , wj) 
are TF-specific but not gene-specific, and are designed to reflect potential 
global differences in quality between the binding data and promoter sequence 
data for TF j. However, since this relative quality is not necessarily known 
a priori, we will treat each weight Wj as an unknown variable. Clearly, if 
only ChIP binding data for TF j are available, then Wj = 1 and equation (3) 
reduces to a function of fry only, whereas if only promoter sequence data for 
TF j are available, then Wj = and equation (3) reduces to a function of rriij 
only. In cases where both data types are available, our model will estimate 
the weight Wj so that our prior distribution moves toward our likelihood 
based on expression data, thereby creating an appropriate balance between 
the two sources of prior information. 

The Bayesian approach gives us a principled framework for connecting 
these model components into a single posterior distribution for all unknown 
parameters: 

p(C,w,Q\g,f,m,b)ocp(g\f,C,@) ■ p{C\m,b,w) ■ p{&,w), 

where denotes the collection of linear model parameters, that is, = 
(a, (3, 7, a 2 ). The term p(g\f, C, 0) represents our first model level with ex- 
pression data g = (go) and / = (fjt) and p(C\m, b, w) represent our second 
model level with ChIP binding data b= (bij) and promoter sequence data 
m = (rriij). All that remains is the specification p(@,w), the prior distri- 
butions for our TF-specific prior weights w = (w±, . . . ,wj) and our linear 
model parameters 0: 

(a) baseline gene i expression: a» ~ Normal(0, t 2 ), 

(b) TF linear effects: 0j ~ Normal(0,r|), 

(c) TF interaction effects: 7-,-/% ~ Normal(0, t 2 ), 

(d) residual gene expression variance: a 2 ~ Inv— xt, 

(e) prior distribution weights: Wj ~ Uniform(0, 1). 

In Section 4.2 we discuss choices of these hyper-parameters r and v that 
are noninfluential on our posterior inference. We estimate the joint poste- 
rior distribution of all unknown parameters by Markov chain Monte Carlo 
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simulation. Specifically, we use Gibbs sampling [Geman (1984)], where we 
iteratively sample values of one set of parameters given all other parameters: 

1. Sampling given C,w and data g,f,b,m. 

2. Sampling C given w, and data g, f,b,m. 

3. Sampling w given C, and data g, f, b, m. 

The details of our Gibbs sampling implementation are given in the Appendix. 
Software for our procedure is available for download at 
http : //www . cbil . upenn . edu/COGRIM/. 

3. Application to the yeast regulatory network. We applied our model 
to extensive available data for the simple organism Saccharomyces cere- 
visiae (budding yeast). We used 314 gene expression experiments, each in- 
volving 6026 yeast genes. A detailed reference list for our expression data 
sources and description of some preliminary data cleaning and manipulation 
is given in the supplemental materials. We also have both ChIP binding data 
[Lee et al. (2002)] and promoter element data [Matys et al. (2003)] for 39 
yeast transcription factors. Thus, the dimension of our observed expression 
data g = (go) is 6026 genes x 314 conditions, while / = (fjt) is 39 TFs 
x 314 conditions. The dimensions of our observed binding data b = (bij) 
and m = (rriij) are both 6026 genes x 39 TFs. Our supplemental materials 
also contain a detailed evaluation of the convergence of our Gibbs sampling 
algorithm. 

In Section 3.1 below we examine our posterior results for the regulation 
indicators C in our model, which are the primary goal of our investiga- 
tion. We use available external information from the biological literature 
to confirm our inference and compare to previous methods. In Section 3.2 
we present additional results for the interaction between TFs in our yeast 
application where the regulatory actions of many TFs are being modeled si- 
multaneously. Finally, in Section 3.3 we examine posterior inference for our 
weighting parameters between the two different sources of available prior in- 
formation for each Yeast transcription factor. A reduced form of our model 
is applied to Yeast and two transcription factors in the higher organism Mus 
musculus (mouse) in a related paper [Chen, Jensen and Stoeckert (2007)]. 

3.1. Inference for regulation indicators C. The samples of each indi- 
cator variable Cij from our Gibbs sampling algorithm were used to esti- 
mate the posterior probability P(Cy = 1) for each possible gene i and TF 
j relationship. We considered any (i,j) combination with posterior prob- 
ability P(Cij = 1) higher than 0.5 as an inferred gene-TF relationship, 
and we then call the ith gene a target gene of transcription factor j. For 
our yeast application, we focus on the inferred target genes for 39 tran- 
scription factors where external validation measures of biological relevance 
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are available. We also give a visual representation of the regulatory net- 
work in our supplemental materials. We use two validation measures, TF 
knockout data and MIPS functional categories, to compare the inferred 
gene sets from our full model involving all three data sources to reduced 
forms of our model that only involve subsets of our data sources. With 
these same two validation measures, we also compare our model inference to 
the inferred target genes produced by several previous integration methods: 
GRAM [Bar-Joseph et al. (2003)], ReMoDiscovery [Lemmens et al. (2006)] 
and MA-Networker [Gao, Foat and Bussemaker (2004)]. Finally, we com- 
pare our model results to target gene sets constructed based on heuristic 
thresholds of single data sources used in isolation. Following the recommen- 
dation of Lee et al. (2002), we use thresholded ChlP-binding data alone by 
classifying any genes with binding p- values less than 0.001 as gene targets. 
We also use thresholded expression data alone by calculating the pairwise 
correlation between gene expression g i and the expression fj of TF j, and 
classifying the most correlated 1% of genes as targets. This 1% threshold 
gave the best performance among several different thresholds that we con- 
sidered. 

Our most reliable validation uses the results of TF knockout experiments 
from the Rosetta Yeast Compendium [Hughes et al. (2000)] for four yeast 
TFs: Yapl, Swi4, Swi5 and Gcn4. Knockout experiments are considered a 
gold standard for the regulatory activity of individual transcription factors. 
In each of these experiments, a knockout strain of yeast was created with 
a specific TF removed from the genome. Microarray chips are then used to 
quantify the knockout response for each gene: the change in expression for 
each gene between the knockout and wild-type strains. Genes that are targets 
of the knocked-out TF should show greater knockout response between the 
wild-type strain and the knock-out strain. 

Within each TF knockout experiment, we calculated a t-statistic for the 
knockout response for genes inferred to be targets by each method, which are 
shown in Figure 1. Methods with larger t-statistic values in Figure 1 show a 
greater knockout response within their inferred target genes, which supports 
the biological relevance of that method. For each TF experiment, our model 
using expression data only ("Exp") is clearly inferior to our model with 
multiple data sources ("All 3" and "ExpChIP"). Our model based on all 
three data sources ("All 3") shows similar performance to our model with- 
out promoter sequence data ("Exp+ChlP"), which suggests that this third 
data source is not contributing substantially to inference. We will revisit 
this issue when we examine our variable weight inference in Section 3.3. 
The inferred target genes from our integrated models ("All 3" and "Ex- 
pChIP") show uniformly superior performance across the four experiments, 
suggesting that our full probabilistic model is capturing more signal than 
previous integrated methods (MA-Networker, GRAM and ReMoDiscovery). 
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The inferred target genes based on thresholded single data shows consider- 
ably worse performance, which demonstrates that an integrated approach 
based on multiple data sources leads to superior inference for regulatory 
networks. 

As a second validation, we used the MIPS database [Mewes (2002)] to as- 
sign a functional category to each gene in our yeast application. For each of 
our 39 TFs, we looked for over-represented functional categories within each 
set of inferred target genes. A set of putative gene targets that share simi- 
lar gene functions are likely to be involved in the same biological pathway, 
which validates the inference that they are regulated by a common tran- 
scription factor. Any functional category with a p-value of less than 0.001 
(p-value calculated using the hypergeometric distribution) was considered 
to be significantly over-represented. The proportion of inferred target genes 
that shared over-represented functional categories (averaged across the 39 
TFs) was calculated for our inferred targets, as well as the inferred targets 
from other methods. We compare the average proportion of over-represented 
functions between methods in Figure 2. As observed in our knockout vali- 
dation, our model using expression data only ("Exp Only") does not per- 
form nearly as well as our model with multiple data sources ("All 3" or 
"Exp+ChIP"). Our model without promoter data ("Exp+ChIP") actually 
performs better than our model with all three data sources ("All 3"), which 
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is also seen in a subset of the knockout experiments above. Both of these in- 
tegrated versions of our model have a higher proportion of over-represented 
functions compared to previous integrated methods, though the performance 
for all integrated methods are quite similar. The inferred target genes from 
each of the integrated methods show substantially greater functional over- 
representation than inferred target genes from the thresholding of a single 
data source, which again confirms that combining multiple data sources can 
improve inference. An interesting side note is that the version of our model 
using expression data alone gives better performance compared with using 
thresholded expression data. This result suggests that our model for expres- 
sion data captures additional signal compared to a threshold approach even 
without integrating additional data sources, though the integrated versions 
of our model give even better results. 



3.2. Inference for linear model parameters. Our yeast application in- 
volves the simultaneous modeling of multiple transcription factors, which 
also allows us to infer the partial linear effects (3 of individual transcription 
factors as well as interaction effects 7 between pairs of transcription factors. 
We consider a particular parameter (3j or 7^ as significant if their 95% pos- 
terior interval does not contain zero. It should be noted that we are actually 
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Fig. 2. Average proportion of genes with over-represented functions. Height of bars rep- 
resents the average proportion of over-represented functions, while lines represent the stan- 
error of the average. 
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examining the posterior distribution of each f3j and 7-,/% parameter condi- 
tional on our regulatory indicators Cj,- = 1, since any genes where Cy = 
make no contribution to the conditional distribution of (3j, as seen in equa- 
tion (6) of the Appendix. Thus, the parameters j3j should be interpreted as 
the linear effect of TF j on gene expression if TF j is a regulator of the 
gene. Similarly, the parameter 7^ should be interpreted as the interaction 
effect of TFs j and k on gene expression if TFs j and k are both regulators 
of the gene. 

Among the linear effects (3, we found sixteen activators (significantly pos- 
itive /3/s) and one repressor (significantly negative fij), which are listed 
in the supplemental materials. Fourteen of the sixteen activators and the 
RME1 repressor discovered by our model were previously reported in the 
SGD database [SGD project (2005)], which gives further evidence that our 
method is very effective at distinguishing appropriate regulatory relation- 
ships. Our model also identified 196 TF pairs which had significant interac- 
tion parameters jjk- Using our inferred regulation indicators C, we imposed 
an additional restriction that each significant pair of TFs had to also share 
at least four target genes in common, which resulted in a reduced set of 
84 TF pairs. A substantial subset of these 84 TF pairs discovered by our 
model are also validated by previous biological studies, as outlined in the 
supplementary materials. 

3.3. Inference about weighting parameters. A novel component of our 
proposed methodology was the introduction of a weighting variable Wj which 
balances the relative quality of the prior distribution based on the ChIP 
binding data versus the prior distribution based on the promoter sequence 
data for each TF j individually. Figure 3 gives a boxplot representation of 
the posterior distributions of the weight variables Wj for all 39 transcription 
factors. 

We see a substantial amount of heterogeneity between each weight vari- 
able, which reflects differences in the quality of available data for different 
transcription factors. We also observe that the posterior distributions for 
nearly all of these transcription factors are centered around values substan- 
tially higher than 0.5, which suggests that the ChIP binding data is being 
favored as the superior source of prior information for our variable selec- 
tion indicators C. The most extreme example is the RME1 transcription 
factor, where essentially all of the posterior mass for Wj is greater than 0.8. 
This general trend matches the common perception of practitioners that a 
ChIP binding experiment will provide better evidence of regulation than 
predictions based on sequence data. Other examples include the four tran- 
scription factors (indicated by a "K" symbol in Figure 3) for which we have 
TF knockout data. In Section 3.1 we noticed that the knockout response 
was generally similar between our full model with all three data sources 
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Fig. 3. Posterior distributions of weight variables. 

and a reduced model without the promoter sequence data (wj set equal to 
1 for all j). Now we see that this similar performance is expected, consid- 
ering that the distribution of Wj in the full model is centered quite close 
to one anyways. It should be noted, however, that this phenomenon is not 
uniform across all transcription factors. Not all posterior distributions of 
Wj are pushed toward the boundary value of 1, and in a few cases, such as 
MSN4, include some posterior mass less than 0.5, which is evidence that the 
promoter sequence data is also making a contribution to inference. 

4. Model sensitivity and consequences. 

4.1. Network sparsity. Most gene regulatory networks are inherently quite 
sparse with only a small subset of all genes controlled by any one transcrip- 
tion factor. In terms of our parameterization, this concept translates into 
an expectation that, for any j, only a small number of genes i will have 
Cij = 1. There are a variety of variable selection methods that enforce spar- 
sity on the selection space, such as the lasso [Tibshirani (1996), Efron et al. 
(2004)]. In the Bayesian variable selection approach, one can also incorpo- 
rate sparsity by using a small prior probability on the selection indicators, 
p(Cij = 1) = q, where a is small (e.g., a = 0.01). In our model, we do not 
have a constant prior probability a for each selection indicator. Rather, we 
have specific prior probabilities based our the ChIP and sequence data, as in 
(3). However, as seen in Figure 4, these probabilities bij and rriij tend to be 
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quite small themselves. To further investigate the sparsity in our model, we 
repeatedly generated regulatory networks C from our prior distribution (3). 
From these repeatedly (m = 10000) generated networks, we estimated the 
probability P(Cy = 1) for each gene i and gene j, and tabulated the number 
Nj of inferred target genes for each TF j [genes i with P{Cij = 1) > 0.5]. 
This procedure is analogous to the inference from our full model, but only 
uses the prior probabilities based on ChIP and sequence data. This entire 
experiment was repeated for different values (ranging from 0.05 to 0.95) of 
each weight parameter Wj, so that we have a range of Nj values for each 
TF j depending on the different values of Wj. Figure 5 shows a boxplot 
that indicates the range of Nj values over all values Wj for each of our 39 
transcription factors. We see that the number of inferred target genes Nj 
for each TF j is quite small relative to the total number of genes in the net- 
work (~6000). These results demonstrate that our prior distribution on the 
network selection indicators Cij is capturing our prior expectations that the 
Yeast regulatory network should be relatively sparse. We also see substantial 
differences between TFs in terms of the variability of the number of target 
genes Nj, which is indicative of significance between-TF variability in the 
response of the inferred target gene set to changing values of the weight w. 
This result provides further motivation for the use of TF-specific weights Wj 
that balance the ChIP and sequence motif data. Finally, we also included 
the number of inferred genes Nj from the posterior distribution of our full 
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Fig. 4. Distribution of 6y and rriij values. 
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Fig. 5. Boxplots of distribution of Nj for each TF j. Black dots are number of inferred 
genes from full model. 



model (black dots) in Figure 5. We see that our posterior inference differs 
substantially from our prior inference for most transcription factors, but the 
overall sparsity of the network is maintained in our posterior distribution. 

4.2. Sensitivity to prior specification. Our prior distribution p(Cij\m,ij,bij, 
Wj) for each regulation indicator Cij is designed to balance the influence of 
the ChIP binding probability b%j and promoter element probability m%j . This 
balance is achieved in equation (3) by using a weighted geometric mean. We 
also explored alternative prior specifications that balance our ChIP bind- 
ing and promoter element data sources. Specifically, we also considered a 
prior distribution for Cij based on the arithmetic mean of our ChIP binding 
probability bij and promoter element probability rriij, 

p(Cij\mij,bij,Wj) 

( 4 ) 

= [wjbij + (1 - w^mijf^ [1 - (wjbij + (1 - w^rriij)} 1 Cij . 

For all 39 TFs, we explored differences between the prior probabilities us- 
ing equation (4) to the prior probabilities using equation (3). Specifically, 
we examined the difference between the two prior distributions in terms of 
the number of a priori inferred target genes for each TF [i.e., number of 
genes i with p(Cy = l[my, bij,Wj) > 0.5 for each TF j]. For this calcula- 
tion, we needed to assume a reasonable value for each weight Wj, so we used 
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the posterior mean of each Wj from Section 3.3. Only three TFs (ABF1, 
RAP1, REB1) showed a substantial difference in the number of a priori 
inferred target genes between the two priors, though these three TFs also 
had the largest total number of a priori inferred target genes among all 39 
TFs. For the remaining TFs, there was very little difference in the number 
of a priori inferred target genes. We also examined the differences in the 
a posteriori inferred target genes for the transcription factor HAP4, and 
found that the list of inferred target genes was quite similar regardless of 
whether our original prior (3) or the alternative prior (4) were implemented. 
We evaluated the small differences between the inferred gene lists using our 
functional categories validation measure (Section 3.1), and found that the 
inferred target genes using our original prior (3) gave a slightly higher pro- 
portion of over-represented functional categories. Given the observed lack 
of substantial difference between the two prior formulations, and since the 
specification (4) is a more complicated functional form to implement in our 
Gibbs sampler, we prefer the use of our original prior specification (3). 

Another issue is the potential sensitivity of our posterior inference to the 
specified prior distributions for the parameters (a, (3,j,a 2 ) which appear in 
the linear model for the expression data (1). The influence of the prior dis- 
tributions given in Section 2 depends on the values of the hyper-parameters 
(v,t 2 ,Tq,t 2 ). The nature of the dependence is clear in the conditional dis- 
tribution formulas in Appendix. The influence of the prior distribution on 
posterior inference for a 2 is very small when v is small. The prior distri- 
butions for the regression coefficients a, (3 and 7 can also be made nonin- 
fluential by making the prior variance hyperparameters t„,tJ and t 2 very 
large. Our posterior results given in Section 3 are based on values of v = 2 
and t 2 = rj = t 2 = 10000. In many variable selection problems with a large 
but sparse covariate space, more informative prior distributions on the re- 
gression coefficients are used that enforce shrinkage toward values of zero 
(exclusion of variables). However, that is not necessary in this case, since we 
have enforced sparsity in our model directly through the selection indicator 
variables C, as detailed in Section 4.1. 

5. Discussion. We have presented a Bayesian hierarchical model for com- 
bining heterogeneous sources of biological data to infer regulatory rela- 
tionships between genes and transcription factors. Within a variable selec- 
tion framework, we build upon previous linear models for gene expression 
data [Bussemaker, Li and Siggia (2001), Tadesse, Vannucci and Lio (2004), 
Gao, Foat and Bussemaker (2004)] by allowing interactions between tran- 
scription factors and incorporating additional information about regulation 
based on other data sources. The Bayesian paradigm allows us to incorporate 
these additional data sources in a natural way through the use of prior dis- 
tributions for our variable selection indicators. This variable selection model 
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also permits genes to belong to multiple regulatory clusters, which allows us 
to model multiple biological pathways simultaneously. Our full probabilistic 
model does not rely on any pre-clustering of our data and reduces depen- 
dence on arbitrary parameter cutoffs compared to previous methods [e.g., 
Liao et al. (2003), Yang (2005), Boulesteix and Strimmer (2005)]. When ap- 
plied to available data in Saccharomyces cerevisiae (Yeast), the inferred re- 
lationships from our model with multiple data sources were shown to be 
biologically relevant using external validation measures, with substantially 
better performance compared with predictions from previous methods (MA- 
Networker, GRAM and ReMoDiscovery) , as well as predictions from thresh- 
olding of a single data source. In addition to inferring gene-TF relationships, 
our model also estimated synergistic and antagonistic interactions between 
transcription factors, many of which were also validated by previous stud- 
ies. 

The use of informative vs. noninformative prior distributions is a topic 
of continued discussion within the Bayesian statistical community. Nonin- 
formative prior distributions are often used in the context where very lit- 
tle prior information is known, but the researcher still prefers a Bayesian 
inferential approach for their applied problem. In other cases, prior infor- 
mation is known about the applied problem, in which case the Bayesian 
paradigm provides a natural way to build this additional information into 
the probability model. Our current methodology provides a pragmatic com- 
promise of these two approaches: we use informed prior distributions for 
our primary inferential targets, the regulation indicators C, but our model 
also involves noninformative prior distributions for parameters of secondary 
interest, such as the coefficients of our linear model for expression data. 
Our approach of building additional data sources into our model via an 
informed prior distribution for our regulation indicators C contrasts with 
most previous Bayesian variable selection research, where criteria are used 
that assume noninformative prior distributions or avoid prior specification 
entirely. See [George (2000)] for a review of these noninformative methods 
and [George and McCulloch (1996)] for a hierarchical Bayesian variable se- 
lection model using noninformative prior distributions. 

In some previous cases, prior knowledge is incorporated into variable se- 
lection, as in the regression model of Garthwaite and Dickey (1996), the 
logistic regression model of Chen, Ibrahim and Yiannoutsos (1999) and the 
generalized linear mixed models of Chen et al. (2003). Even more related to 
our application, Sabatti (2005) used an informed prior distribution based 
on binding site data to model regulatory networks. However, in their model, 
only regulation indicators with strong prior evidence are allowed to be 
nonzero, so that a gene-TF relationship without prior evidence based on 
sequence data is not permitted regardless of the evidence from gene ex- 
pression data. Despite relaxing this restriction in our model, our inferred 
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gene regulatory network remains quite sparse, as seen in Section 4.1. The 
popular "network component analysis" approach [Liao et al. (2003), Yang 
(2005), Boulesteix and Strimmer (2005)] also assumes that the relationships 
derived from ChIP binding data alone are known without error. This is a 
rather restrictive assumption, especially when one considers that ChIP ex- 
periments are typically limited to a single condition, but TF binding can 
vary across different conditions. In contrast, our model allows inferred rela- 
tionships based on strong gene expression evidence that are not completely 
evident based on our prior information (ChIP binding or promoter sequence 
data). Although TF expression is not a perfect proxy for TF activity, we 
believe it is the best experiment-specific measure of TF activity that our 
current data resources allow. Our model could certainly be further improved 
by using a more direct measure of TF activity, such as actual TF protein 
levels in the cell, but available data on TF protein levels is severly limited 
at this time. 

A fundamental element of our informed prior approach is that we actually 
have a choice between two prior distributions for our regulation indicators 
C, one informed prior based on ChIP binding data, and another based on 
promoter sequence data. Since we do not know from application to appli- 
cation (in this case, from transcription factor to transcription factor) which 
data source is more accurate, we introduce a variable weight that provides 
a balance between the two prior distributions. This weight variable w is 
itself assigned a noninformative uniform prior distribution, and we also as- 
sign noninformative prior distributions for our linear model parameters. The 
weighting of different sources of information in a Bayesian model is briefly 
mentioned by Berry and Hochberg (1999). Ibrahim and Chen (2000) and 
Chen et al. (2003) introduce the power prior distribution: a weight between 
their regression model likelihood and a prior distribution based on histor- 
ical data. In contrast, our weight variables are used as a balance between 
two "competing" prior distributions, which means that the estimated pos- 
terior distribution of each weight variable can shed substantial insight into 
the relative quality of our two sources of prior data. In fact, our weighted 
prior distribution can be interpreted as the combination of our two sources 
of prior information that best matches the likelihood distribution based on 
expression data. The results from our Yeast application indicate that our 
variable weight methodology achieves an appropriate balance between our 
two sources of prior information. Our results confirm the commonly-held 
belief that promoter sequence data is generally much less reliable than the 
ChIP binding data, although promoter sequence data can be useful in some 
cases. 
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APPENDIX: GIBBS SAMPLING IMPLEMENTATION 



The posterior distribution of our unknown parameters is proportional to 
the product of our model likelihood and our assumed prior distributions, 
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where Xijt = Cyfjt- We use the following Gibbs sampling [Geman (1984)] 
steps to estimate the joint posterior distribution of all unknown parameters. 

Step 1. Sampling linear model parameters 0. The regulation matrix C is 
assumed known during this step, so we do not need to use our prior data 
b,m or the current values of w. We use C to construct the variables X, 
where Xijt — Cijfjt- The linear model parameters are then separately 
estimated by the following iterative strategy. Note that in the steps below, 
we have combined our interaction coefficients jjk and linear coefficients (3j 
into a single set of parameters (3. Since each intercept Qj is independent 
from the other a's, they can be separately sampled, 
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We can make our prior distribution for each to be noninformative by 
making r a very large (in this study, 10000) relative to the contribution of 
the likelihood to the variance (<r 2 /T). 
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Our slope and interaction coefficients /3j 's are not independent from each 
other, and so must be iteratively sampled: 



p((3j\a,a 2 ,g,X) 



oc exp 



-1 

-1 



N T 



Oii 



2(J 2 



EE [an 

i=i t=i \ 
N T 



j y 

E Pj' x w 



■ exp 



2r 2 fi 



= exp 
where V it = gu - a 

p(Pj\a,a 2 ,g,X) oc exp 



exp 



— P 2 



i=l t=l 

J2j'^j Pj'Xift, which reduces further to 



-1 

2^ 



0j f ' Ty X 

a 1 



2i 



where vp = (T X x/v 2 + l/ r j; 
This distribution implies that 
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We can make our prior distribution for each (3j to be noninformative by 
making Tp very large (in this study, 10000) relative to the contribution of 
the likelihood to the variance {a 2 /Txx)- 

For the residual variance a 2 , we use a x 2 P r i° r distribution for a with 
hyper-parameter v = 2, which results in the following conditional distribu- 
tion: 
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where V a = J2i=iJ2t=i(9it ~ a i ~ J2j=i Pj x ijt) ■ We see that the influence 
of this prior is very small on the posterior distribution for a 2 , which is a 
scaled-inverse x 2 distribution with degrees of freedom parameter TN + 2 
and scale parameter s 2 = (V a + 1)/(TN + 2). 

Step 2. Sampling regulation indicators C . We are assuming that both our 
linear model parameters and our weights w are known for this step of the 
algorithm. When estimating a new value for each Cy , we also can condition 
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on C', which is all the other Ci>y values in C (i' and j' 7^ j). This gives 
us the following conditional distribution for Cj,-: 



p(Cij\®,w,C',g,f,b,m) 



(7) 
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Let Zi be the value of equation (7) when dj = 1 and Zq be the value of 
equation (7) when dj = 0. We sample a new value of Cj,- equal to 1 or 
with probabilities proportional to Z\ or Zq respectively. 

Step 3. Sampling prior weights w. We are assuming that the regulation 
matrix C is known for this step of the algorithm, so we do not need to use 
any of the expression data, g or linear model parameters for this step. 
For each TF j, we need to sample a new weight Wj based on the following 
distribution: 
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The normalizing constant A(wj) present in (8) comes from the integration 
of p(wj,Cj\b,m) over all configurations of C y. 
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We sample a new value Wj via grid sampling: we evaluate (8) over a fine grid 
of points in the unit interval, and sample one of these points with probability 
proportional to (8). Multiple chains of our Gibbs sampling algorithm were 
run from different starting points until we were confident that the chains had 
converged to the same range of values. Details of our convergence diagnostics 
are given in the supplemental materials. 
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